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Abstract 

Existing statistical approaches to natural lan- 
guage problems are very coarse approximations 
to the true complexity of language processing. 
As such, no single technique will be best for 
all problem instances. Many researchers are 
examining ensemble methods that combine the 
output of successful, separately developed mod- 
ules to create more accurate solutions. This pa- 
per examines three merging rules for combin- 
ing probability distributions: the well known 
mixture rule, the logarithmic rule, and a novel 
product rule. These rules were applied with 
state-of-the-art results to two problems com- 
monly used to assess human mastery of lexi- 
cal semantics — synonym questions and analogy 
questions. All three merging rules result in en- 
sembles that are more accurate than any of their 
component modules. The differences among the 
three rules are not statistically significant, but 
it is suggestive that the popular mixture rule is 
not the best rule for either of the two problems. 



1 Introduction 

Asked to articulate the relationship between the 
words broad and road, you might consider a num- 
ber of possibilities. Orthographically, the second 
can be derived from the first by deleting the ini- 
tial letter, while semantically, the first can mod- 
ify the second to indicate above-average width. 
Many possible relationships would need to be con- 
sidered, depending on the context. In addition, 
many different computational approaches could 
be brought to bear, leaving a designer of a natu- 
ral language processing system with some difficult 
choices. A sound software engineering approach 
is to develop separate modules using independent 
strategies, then to combine the output of the mod- 
ules to produce a unified solver. 

The concrete problem treated here is predicting 
the correct answers to multiple-choice questions. 



Each instance consists of a context and a finite set 
of choices, one of which is correct. Modules pro- 
duce a probability distribution over the choices 
and a merging rule is used to combine these dis- 
tributions into one. This distribution, along with 
relevant utilities, can then be used to select a can- 
didate answer from the set of choices. The merg- 
ing rules we considered are parameterized, and 
we set parameters by a maximum likelihood ap- 
proach on a collection of training instances. 

Many problems can be cast in a multiple- 
choice framework, including optical digit recogni- 
tion (choices are the 10 digits), word sense disam- 
biguation (choices are a word's possible senses), 
text categorization (choices are the classes), and 
part-of-speech tagging (choices are the grammat- 
ical categories). This paper looks at multiple- 
choice synonym questions (part of the Test of En- 
glish Foreign Language) and multiple-choice 
verbal analogy questions (part of the SAT^). Re- 
cent work has demonstrated that algorithms for 
solving multiple-choice synonym questions can be 
used to determine the semantic orientation of a 
word; that is, whether the word conveys prais e 



criticism ( Tirnev and LittmanI 



m press, 



or 

Other research has shown that algorithms for 
solving multiple-choice verbal analogy questions 
can be used to determine the semantic relation 
in a noun-modifier expression; for example, in 
the noun-modifier expression "laser printer", the 



^The College Board has announced that analo- 
gies will be eliminated from the SAT in 2005 
(http: //www. collegeboard. com/ about /newsat/ 
newsat.html) as part of a shift in the exam to re- 
flect changes in the curriculum. The SAT was introduced 
as the Scholastic Aptitude Test in 1926, its name was 
changed to Scholastic Assessment Test in 1993, then 
changed to simply SAT in 1997. 



modifier "laser" is an instrument used b y the 
noun "printer" ( Turnev and LittmanI 2003l l . 

The paper offers two main contributions. First, 
it introduces and evaluates several new modules 
for answering multiple-choice synonym questions 
and verbal analogy questions; these may be use- 
ful for solving problems in lexical semantics such 
as determining semantic orientation and seman- 
tic relations. Second, it presents a novel product 
rule for combining module outputs and compares 
it with other similar merging rules. 

Section [51 formalizes the problem addressed in 
this paper and introduces the three merging rules 
we study in detail: the mixture rule, the logarith- 
mic rule, and the product rule. Section |21 presents 
empirical results on synonym and analogy prob- 
lems. Section |1] summarizes and wraps up. 

2 Module Combination 

The following synonym question is a typical 
multiple-choice question: hidden:: (a) laughable, 
(b) veiled, (c) ancient, (d) revealed. The stem, 
hidden, is the question. There are /c = 4 choices, 
and the question writer asserts that exactly one 
(in this case, (b)) has the same meaning as the 
stem word. The accuracy of a solver is measured 
by its fraction of correct answers on a set of i 
testing instances. 

In our setup, knowledge about the multiple- 
choice task is encapsulated in a set of n mod- 
ules, each of which can take a question instance 
and return a probability distribution over the k 
choices. For a synonym task, one module might 
be a statistical approach that makes judgments 
based on analyses of word co-occurrence, while 
another might use a thesaurus to identify promis- 
ing candidates. These modules are applied to a 
training set of m instances, producing probabilis- 
tic "forecasts"; > represents the probability 
assigned by module 1 < i < n to choice 1 < j < k 
on training instance 1 < h < m. The estimated 
probabilities are distributions of the choices for 
each module i on each instance h: J2jPij — 1- 

2.1 Merging Rules 

The merging rules we considered are parameter- 
ized by a set of weights Wi, one for each mod- 
ule. For a given merging rule, a setting of the 
weight vector w induces a probability distribu- 
tion over the choices for any instance. Let Dj'^ 
be the probability assigned by the merging rule to 



choice j of training instance h when the weights 
are set to w. Let 1 < a{h) < k he the correct 
answer for instance h. We set weights to max- 
imize the likelihood of the training data: w = 
argmax^/ ^a{h) ■ '^^^ same weights maximize 
the mean likelihood, the geometric mean of the 
probabilities assigned to correct answers. 

We focus on three merging rules in this pa- 
per. The mixture rule combines module out- 
puts using a weighted sum and can be written 
j^h,w ^ Wip'^,, where 



D 



h,w 



M: 



h,w 



h,w 



is the probability assigned to choice j of instance 
h and < Wi < 1. The rule can be justified 
by assuming each instance's answer is generated 
by a single module chosen via the distribution 

The logarithmic rule combines the logarithm 
of module outputs by L^'"' = exp(X]j Wi Inp^ ^ — 
Ui{PtT\ where 



D 



h,w 



L 



h,w 



11,111 



is the probability the rule assigns to choice j of 
instance h. The weight Wi indicates how to scale 
the module probabilities before they are combined 
multiplicatively. Note that modules that output 
zero probabilities must be modified before this 
rule can be used. 

The product rule can be written in the form 
ph,w ^ Y{i{wiV% + (1 - Wi)/k), where 



D 



h,w 



p: 



,h,w 



TP 



h,w 



is the probability the rule assigns to choice j. The 
weight < < 1 indicates how module i's out- 
put should be mixed with a uniform distribution 
(or a prior, more generally) before outputs are 
combined multiplicatively. As with the mixture 
and logarithmic rules, a module with a weight of 
zero has no influence on the final assignment of 
probabilities. Note that the product and loga- 
rithmic rules coincide when weights are all zeroes 
and ones, but differ in how distributions are scaled 
for intermediate weights. We do not have strong 
evidence that the difference is empirically signifi- 
cant. 



2.2 Derivation of Product Rule 

In this section, we provide a justification for com- 
bining distributions multiplicatively, as in both 
the product and logarithmic rules. Our analy- 
sis assumes modules are calibrated and indepen- 
dent. The output of a calibrated module can be 
treated valid probability distribution — for ex- 
ample, of all the times the module outputs 0.8 for 
a choice, 80% of these should be correct. Note 
that a uniform distribution — the output of any 
module when its weight is zero for both rules — is 
guaranteed to be calibrated because the output 
is always 1/k and 1/k of these will be correct. 
Modules are independent if their outputs are in- 
dependent given the correct answer. We next ar- 
gue that our parameterization of the product rule 
can compensate for a lack of calibration and in- 
dependence. 

Use of Weights. First, module weights can im- 
prove the calibration of the module outputs. Con- 
sider a module i that assigns a probability of 1 to 
its best guess and to the other three choices. 
If the module is correct 85% of the time, setting 
Wi = 0.8 in the product rule results in adjusting 
the output of the module to be 85% for its best 
guess and 5% for each of the lesser choices. This 
output is properly calibrated and also maximizes 
the likelihood of the data.^ 

Second, consider the situation of two modules 
with identical outputs. Unless they are perfectly 
accurate, such modules are not independent and 
combining their outputs multiplicatively results 
in "double counting" the evidence. However, as- 
signing either module a weight of zero renders the 
modules independent. Once again, such a set- 
ting of the weights maximizes the likelihood of 
the data. 

Multiplicative Combination. We now argue 
that independent, calibrated modules should be 
combined multiplicatively. Let be the ran- 
dom variable representing the correct answer to 
instance h. Let = (p^^, . . . be the output 
vector of module i on instance h. We would like 
to compute the probability the correct answer is j 
given the module outputs, Pr(^'^ = j\pi, . . . ,Pn), 
which we can rewrite with Bayes rule as 



Ft{p'1, . . . ,p'^\A'' = j)FT{A^ 
Fr{p'l,...,p'^) 



(1) 



^The logarithmic rule can also calibrate this module, 
as long as its output is renormalized after adding a small 
constant, say, e = 0.00001, to avoid logarithms of -co. 
In this case, Wi « .2461 works, although the appropriate 
weight varies with e. 



Assuming independence, and using Z as a nor- 
malization factor. Expression ^ can be decom- 
posed into 

Fr{p'l\A^ =])■■■ Fv{p^\A^ = j) Fi[A^ = j) 

Applying Bayes rule to the individual factors, we 
get 

FT{A^ = j\p\)...FT{A^ = j\pi) 

FT{Ah = jY-^Z' ^ ' 

by collecting constant factors into the normaliza- 
tion factor Z' . Using the calibration assumption 
Pr(^^ = j\p^) = Pij, Expression 121 simplifies to 
Ylj^p'lj/Fi{A'^ = j)^~^/Z'. Finally, we precisely 
recover the unweighted product rule using a fi- 
nal assumption of uniform priors on the choices, 
Fic{A^ = j) = 1/k, which is a natural assumption 
for standardized tests. 

2.3 Weight Optimization 

For the experiments reported here, we adopted 
a straightforward approach to finding the weight 
vector w that maximizes the likelihood of the 
data. The weight optimizer reads in the output 
of the modules^, chooses a random starting point 
for the weights, then hillclimbs using an approx- 
imation of the partial derivative. The entire op- 
timization procedure is repeated 10 times from a 
new random starting point to minimize the in- 
fluence of local minima. Although more sophisti- 
cated optimization algorithms are well known, we 
found that the simple discrete gradient approach 
worked well for our application. 

2.4 Related Work 

Merging rules of various sorts have been studied 
for many years, and have gained prominence re- 
cently for natural language applications. 

Use of the mixture rule and its variations is 
q uite common. Recent examples include the work 
of Brill and Wul (Il998l 'l on part-of-speech tagging, 

on crossword-puzzle clues 



Liftman et al 



(I2OO2I 



and lFlorian and Yarowskvl ( 2002) on a word-sense 

''For the reasons suggested in the previous footnote, for 
each question and module, the optimizer adds 0.00001 to 
each output and renormalizes the distribution (scales it to 
add to one). We found this necessary for both the logarith- 
mic and mixture rules, but not the product rule. Parame- 
ters were set by informal experimentation, but the results 
did not seem to be sensitive to their exact values. 



disambiguation task. In all of these cases, the 
authors found that the merged output was a sig- 
nificant improvement on that of the powerful in- 
dependently engineered component modules. We 
use the name "mixture rule " by analogy to th e 



mixture of experts model ( Jacobs et al. 199ll l. 



which combined expert opinions in an analogous 
way. In the forecasting literature, this rule is also 
known as the linear opinion pool; Jacobsl (|l995h 
provides a summary of the theory and applica- 
tions of the mixture rule in this se tting. 

The logarithmic opinion pool of HeskesI ( 1998h 
is the basis for our logarithmic rule. The pa- 
per argued that its form can be justified as an 
optimal way to minimize Kullback-Leibler di- 
vergence between the output of an ensemble of 
ada ptive experts and target outputs. Boost- 
ing (ISchaDirelll999ll also uses a logistic-regression- 
like rule to combine outputs of simple modules to 
perform state-of-the-art classification. The prod- 
uct of experts approach also_combiiies distribu- 
tions multiplicatively, and HintonI ( 1999h argues 
that this is an improvement over the "vaguer" 
probability judgments commonl y resultin g: from 
the mixture rule. A survey bv IXu et al.l ((1992) 
includes the equal-weights version of the mixture 
rule and a derivation of the unweighted product 
rule. 

An important contribution of the current work 
is the product rule, which shares the simplicity 
of the mixture rule and the probabilistic justifica- 
tion of the logarithmic rule. We have not seen an 
analog of this rule in the forecasting or learning 
literatures. 

3 Experimental Results 

We applied the three merging rules to synonym 
and analogy problems, as described next. 

3.1 Synonyms 

We constructed a training set of 431 4-choice syn- 
onym questions^ and randomly divided them into 
331 training questions and 100 testing questions. 
We created four modules, described next, and ran 
each module on the training set. We used the re- 
sults to set the weights for the mixture, logarith- 

*Our synonym question set consisted of 80 TOEFL 
questions provided by ETS via Thomas Landauer, 50 ESL 
questions created by Donna Tatsuki for Japanese ESL stu- 
dents, 100 Reader's Digest Word Power questions gathered 
by Peter Turney, Mario Jarmasz, and Tad Stach, and 201 
synonym pairs and distractors drawn from different sources 
including crossword puzzles by Jeffrey Bigham. 



mic, and product rules and evaluated the result- 
ing synonym solver on the test set. 

Module outputs, where applicable, were nor- 
malized to form a probability distribution by scal- 
ing them to add to one before merging. 
LSA. Following Landauer and Dumaij ( 1997 1. 
we used latent semantic analysis to recognize 
synonyms. Our LSA module queried the web 
interface developed at the University of Col- 
orado ( |http : // Isa . Color ado . eduD , which has a 300- 
dimensional vector representation for each of tens 
of thousands of words. The similarity of two 
words is measured by the cosine of the angle be- 
tween their corresponding vectors. 
PMI-IR. Our Pointwise Mutual Information- 
Information Retrieval module used the AltaVista 
search engine to determine the number of web 
pages that contain the choice and stem in close 
proximity. PMI-IR used the third scoring method 
(near each ot her, but not near not) designed by 
Turnev ( 200 ll ) , since it performed best in this ear- 



lier study. 

Thesaurus. Our Thesaurus module also used 
the web to measure stem-choice similarity. The 
module queried the Wordsmyth thesaurus online 
at www . wordsmyth . net and Collected any words listed 
in the "Similar Words", "Synonyms", "Crossref. 
Syn.", and "Related Words" fields. The module 
created synonym lists for the stem and for each 
choice, then scored them according to their over- 
lap. 

Connector. Our Connector module used sum- 
mary pages from querying Google (googie.com) 
with pairs of words to estimate stem-choice simi- 
larity (20 summaries for each query). It assigned 
a score to a pair of words by taking a weighted 
sum of both the number of times they appear sep- 
arated by one of the symbols [, ", :, ,, =, /, \, 
(, ], means, defined, equals, synonym, whitespace, 
and and and the number of times dictionary or the- 
saurus appear anywhere in the Google summaries. 
Results. Table ^ presents the result of training 
and testing each of the four modules on synonym 
problems. The first four lines list the accuracy 
and mean likelihood obtained using each module 
individually (using the product rule to set the in- 
dividual weight). The highest accuracy is that 
of the Thesaurus module at 69.6%. All three 
merging rules were able to leverage the combi- 
nation of the modules to improve performance 
to roughly 80% — statistically significantly better 



Synonym 






Mean 


Solvers 


Accuracy 


likelihood 


LSA only 


43.8 


% 


.2669 


PMI-IR only 


69.0 


% 


.2561 


Thesaurus only 


69.6 


% 


.5399 


Connector only 


64.2 


% 


.3757 


All: mixture 


80.2 


% 


.5439 


All: logarithmic 


82.0 


% 


.5977 


All: product 


80.0 


% 


.5889 



Table 1: Comparison of results for merging rules 
on synonym problems. 



Reference 



Accuracy 95% confidence 



L & D (1997) 64.40% 

non-native speakers 64.50% 

Turney (2001) 73.75% 

J & S (2002) 78.75% 

T & C (2003) 81.25% 

Product rule 97.50% 



52.90-74.8u7o 
53.01-74.88% 
62.71-82.96% 
68.17-87.11% 
70.97-89.11% 
91.26-99.70% 



Table 2: Published TOEFL synonym results. 
Confidence intervals computed via exact binomial 
distributions. 

than the best individual module. It seems this 
domain lends itself very well to an ensemble ap- 
proach. 

Although the accuracies of the merging rules 
are nearly identical, the product and logarith- 
mic rules assign higher probabilities to correct 
answers, as evidenced by the mean likelihood. 
To illustrate the decision-theoretic implications of 
this difference, imagine the probability judgments 
were used in a system that receives a score of +1 
for each right answer and —1/2 for each wrong 
answer, but can skip questions.^ In this case, the 
system should make a guess whenever the highest 
probability choice is above 1/3. For the test ques- 
tions, this translates to scores of 71.0 and 73.0 for 
the product and logarithmic rules, but only 57.5 
for the mixture rule; it skips many more questions 
because it is insufficiently certain. 

Related Work and Discussion. 

Landauer and Dumaid (1997) introduced the 
Test of English SiS Si Foreign Language (TOEFL) 
synonym task as a way of assessing the accuracy 
of a learned representation of lexical semantics. 
Several studies have since used the same data set 



^The penalty value of —1/2 was chosen to illustrate this 
point. Standardized tests often use a penalty of— l/(fc — 1), 
which grants random guessing and skipping equal utility. 



for direct comparability; Table |21 presents these 
results. 

The accuracy of 

LSA ( Landauer and Dumaid 19971 ^ is statistically 
indistinguishable from that of a population of 
non-native Engl ish speakers on the same ques- 
tions. PMI-IR ((Turnevl 120011 ) performed better, 
but the difference is not st atistically sign ificant. 
Jarmasz and Szpakowica (|in press. 2003h give 
results for a number of relatively sophisticated 
thesaurus-based methods that looked at path 
length between words in the heading classifica- 
tions of Roget's Thesaurus. Their best scoring 
-method was a statistically significant improve- 
ment over t he LSA results, but not over those 
of PMLIR. iTerra and Clarkd (l2003h studied a 
variety of corpus-based similarity metrics and 
measures of context and achieved a statistical 
tie with PMI-IR and the results from Roget's 
Thesaurus. 

To compare directly to these results, we re- 
moved the 80 TOEFL instances from our collec- 
tion and used the other 351 instances for train- 
ing the product rule. Unlike the previous stud- 
ies, we used training data to set the parameters 
of our method instead of selecting the best scor- 
ing method post hoc. The resulting accuracy 
was statistically significantly better than all previ- 
ously published results, even though the individ- 
ual modules performed nearly identically to their 
published counterparts. In addition, it is not pos- 
sible to do significantly better than the product 
rule on this dataset, according to the Fisher Ex- 
act test. This means that the TOEFL test set 
is a "solved" problem — future studies along these 
lines will need to use a more challenging set of 
questions to show an improvement over our re- 
sults. 

3.2 Analogies 

Synonym questions are unique because of the ex- 
istence of thesauri — reference books designed pre- 
cisely to answer queries of this form. The re- 
lationships exemplified in analogy questions are 
quite a bit more varied and are not systemati- 
cally compiled. For example, the analogy ques- 
tion cat:meow:: (a) mouse:scamper, (b) bird:peck, 
(c) dog:bark, (d) horse:groom, (e) lion:scratch re- 
quires that the reader recognize that (c) is the an- 
swer because both (c) and the stem are examples 
of the relation "X is the name of the sound made 
by y . This type of common sense knowledge is 



rarely explicitly documented. 

In addition to the computational challenge they 
present, analogical reasoning is recognized as an 
important component in cognition, including lan- 
guage comprehension (La koff and Johnson 1980 ) 



and high leve l perception jchaJmers et a1 .1 119921 ). 
' iJ (120021 ) surveys computational approaches 



Frenc. 



to analogy making. 

To study module merging for analogy prob- 
lems, we collected 374 5-choice instances.^ We 
randomly split the collection into 274 training in- 
stances and 100 testing instances. 

We next describe the novel modules we devel- 
oped for attacking analogy problems and present 
their results. 

Phrase Vectors. We wish to score candidate 
analogies of the form A:B::C:D (A is to B as C is to 
D). The quality of a candidate analogy depends on 
the similarity of the relation Ri between A and B 
to the relation i?2 between C and D. The relations 
i?i and R2 are not given to us; the task is to infer 
these relations automatically. One approach to 
this task is to create vectors ri and r2 that repre- 
sent features of Ri and R2, and then measure the 
similarity of Ri and R2 by the cosine of the angle 
between the vectors: ri • ?'2/\/ (n • '^i)(?'2 ■ i"2)- 

We create a vector, r, to characterize the rela- 
tionship between two words, X and Y, by counting 
the frequencies of 128 different short phrases con- 
taining X and Y. Phrases include "X for Y", "Y 
with X", "X in the Y", and "Y on X". We use 
these phrases as queries to AltaVista and record 
the number of hits (matching web pages) for each 
query. This process yields a vector of 128 numbers 
for a pair of words X and Y. In experiments with 
our development set, we found that accuracy of 
this approach to scoring analogies improves when 
we use the logarithm of the frequency. The re- 
sulting vector r is a kind of signature of the rela- 
tionship between X and Y. 

For example, consider the analogy traffic:street:: 
water: riverbed. The words traffic and street tend 
to appear together in phrases such as "traffic 
in the street" and "street with traffic", but not 
in phrases such as "street on traffic" or "traf- 
fic for street. Similarly, water and riverbed may 



®Our analogy question set was constructed by the au- 
thors from books and web sites intended for students 
preparing for the SAT, including 90 questions from unoffi- 
cial SAT-prep websites, 14 questions ETS's web site, 190 
questions scanned in from a book with actual SAT exams, 
and 80 questions typed from SAT guidebooks. 



appear together as "water in the riverbed", but 
"riverbed on water" would be uncommon. There- 
fore, the cosine of the angle between the 128- 
vector ri for traffic:street and the 128-vector r2 
for water: riverbed would likely be relatively large. 
Thesaurus Paths. Another way to characterize 
the semantic relationship, R, between two words, 
X and Y, is to find a path through a thesaurus or 
dictionary that connects X to Y or Y to X. 

In ou r experiments, w e used the WordNet the- 
saurus ( FellbaurnI 1998h . We view WordNet 
directed graph and the Thesaurus Paths module 
performed a breadth-first search for paths from X 
to Y or Y to X. The directed graph has six kinds 
of links, hypernym, hyponym, synonym, antonym, 
stem, and gloss. For a given pair of words, X and 
Y, the module considers all shortest paths in ei- 
ther direction up to three links. It scores the can- 
didate analogy by the maximum degree of similar- 
ity between any path for A and B and any path for 
C and D. The degree of similarity between paths 
is measured by their number of shared features: 
types of links, direction of the links, and shared 
words. 

For example, consider the analogy defined by 
evaporate:vapor::petrify:stone. The most similar 
pair of paths is: 

evaporate — > (gloss: change into a vapor) vapor 
and petrify (gloss: change into stone) stone. 
These paths go in the same direction (from first 
to second word), they have the same type of links 
(gloss links), and they share words (change and 
into). Thus, this pairing would likely receive a 
high score. 

Lexical Relation Modules. We implemented a 
set of more specific modules using the WordNet 
thesaurus. Each module checks if the stem words 
match a particular relationship in the database. 
If they do not, the module returns the uniform 
distribution. Otherwise, it checks each choice pair 
and eliminates those that do not match. The rela- 
tions tested are: Synonym, Antonym, Hypernym, 
Hyponym, Meronym:substance, Meronym:part, 
Meronym:member, Holonym:substance, and also 
Holonym:member. These modules use some 
heuristics including a simple kind of lemmatiza- 
tion and synonym expansion to make matching 
more robust. 

Similarity. Dictionaries are a natural source 
to use for solving analogies because definitions 
can express many possible relationships and are 



Analogy 




Mean 


Solvers 


Accuracy 


likelihood 


Phrase Vectors 


38.2% 


.2285 


Thesaurus Paths 


25.0% 


.1977 


Synonym 


20.7% 


.1890 


Antonym 


24.0% 


.2142 


Hypernym 


22.7% 


.1956 


Hyponym 


24.9% 


.2030 


Mer onym : substance 


20.0% 


.2000 


Meronym:part 


20.8% 


.2000 


Mer onym : memb er 


20.0% 


.2000 


Holonym:substance 


20.0% 


.2000 


Holony m : memb er 


20.0% 


.2000 


Similarity:dict 


18.0% 


.2000 


Similarity :wordsmyth 


29.4% 


.2058 


all: mixture 


42.0% 


.2370 


all: logarithmic 


43.0% 


.2354 


all: product 


45.0% 


.2512 


no PV: mixture 


31.0% 


.2135 


no PV: logarithmic 


30.0% 


.2063 


no PV: product 


37.0% 


.2207 



Table 3: Comparison of results for merging rules 
on analogy problems. 



likely to make the relationships more explicit than 
they would be in general text. We implemented 
two definition similarity modules: Similarity:dict 
uses Dictionary.com for definitions and Similar- 
ity:wordsmyth uses Wordsmyth.net. Each module 
treats a word as a vector formed from the words in 
its definition. Given a potential analogy A:B::C:D, 
the module computes a vector similarity of the 
first words (A and C) and adds it to the vector 
similarity of the second words (B and D). 

Results. We ran the 13 modules described above 
on our set of training and testing analogy in- 
stances, with the results appearing in Table El 
(the product rule was used to set weights for com- 
puting individual module mean likelihoods). For 
the most part, individual module accuracy is near 
chance level (20%), although this is misleading 
because most of these modules only return an- 
swers for a small subset of instances. Some mod- 
ules did not answer a single question on the test 
set. The most accurate individual module was the 
search-engine-based Phrase Vectors (PV) module. 
The results of merging all modules was only a 
slight improvement over PV alone, so we exam- 
ined the effect of retraining without the PV mod- 
ule. The product rule resulted in a large improve- 



ment (though not statistically significant) over 
the best remaining individual module (37.0% vs. 
29.4% for Similarity: wordsmyth). 

We once again examined the result of deduct- 
ing 1/2 point for each wrong answer. The full 
set of modules scored 31, 33, and 43 using the 
mixture, logarithmic, and product rules. As in 
the synonym problems, the logarithmic and prod- 
uct rules assigned probabilities more precisely. In 
this case, the product rule appears to have a ma- 
jor advantage, although this might be due to the 
particulars of the modules we used in this test. 

The TOEFL synonym problems proved fruit- 
ful in spurring research into computational ap- 
proaches to lexical semantics. We believe attack- 
ing analogy problems could serve the research 
community even better, and have created a set 
of 10 previously pu blished SAT analogy prob- 
lems ()Cla,ma,nl bood). Our best analogy solver 
from the previous experiment has an accuracy of 
55.0% on this test set.^ We hope to inspire others 
to use the same set of instances in future work. 

4 Conclusion 

We applied three trained merging rules to a set 
of multiple-choice problems and found all were 
able to produce state-of-the-art performance on a 
standardized synonym task by combining four less 
accurate modules. Although all three rules pro- 
duced comparable accuracy, the popular mixture 
rule was consistently weaker than the logarithmic 
and product rules at assigning high probabilities 
to correct answers. We provided first results on a 
challenging verbal analogy task with a set of novel 
modules that use both lexical databases and sta- 
tistical information. 

In nearly all the tests that we ran, the logarith- 
mic rule and our novel product rule behaved sim- 
ilarly, with a hint of an advantage for the product 
rule. One point in favor of the logarithmic rule 
is that it has been better studied so its theoret- 
ical properties are better understood. It also is 
able to "sharpen" probability distributions, which 
the product rule cannot do without removing the 
upper bound on weights. On the other hand. 



^Although less accurate than our synonym solver, the 
analogy solver is similar in that it excludes 3 of the 5 
choices for each instance, on average, while the synonym 
solver excludes roughly 3 of the 4 choices for each instance. 
Note also that an accuracy of 55% approximately corre- 
sponds to the mean verbal SAT score fo r college-bound 
seniors in 2002 (iTurnev and Littmanll2003l) . 



the product rule is simpler, executes much more 
rapidly (8 times faster in our experiments), and is 
more robust in the face of modules returning zero 
probabilities. We feel the strong showing of the 
product rule on lexical multiple-choice problems 
proves it worthy of further study. 
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